Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
28869
posts in
57.3
ms
A
Queueing-Theoretic
Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
🏗️
LLM Infrastructure
arxiv.org
·
5d
The
Inference
Shift
🏗️
LLM Infrastructure
stratechery.com
·
1d
·
Hacker News
Tracing tokens through Llama 3.1
8B
inference on
H100s
🏗️
LLM Infrastructure
krithik.xyz
·
3d
·
Hacker News
Understanding
KV
Cache in LLMs and How It
Affects
Inference
💾
Prompt Caching
pub.towardsai.net
·
3d
Reformulating
KV Cache
Eviction
Problem for Long-Context LLM Inference
💾
Prompt Caching
arxiv.org
·
1d
Fitting Is Not Enough:
Smoothness
in Extremely
Quantized
LLMs
🔬
RaBitQ
arxiv.org
·
12h
A General Framework for Multimodal LLM-Based
Multimedia
Understanding in Large-Scale
Recommendation
Systems
🎨
Chroma
arxiv.org
·
12h
LEAD:
Length-Efficient
Adaptive
and Dynamic Reasoning for Large Language Models
🏗️
LLM Infrastructure
arxiv.org
·
12h
MISA
: Mixture of
Indexer
Sparse Attention for Long-Context LLM Inference
📏
ANN Benchmarks
arxiv.org
·
1d
Weight Pruning
Amplifies
Bias: A Multi-Method Study of
Compressed
LLMs for Edge AI
📱
Edge AI Optimization
arxiv.org
·
12h
Inference Time
Causal
Probing
in LLMs
🏗️
LLM Infrastructure
arxiv.org
·
1d
Understanding
Asynchronous
Inference
Methods
for Vision-Language-Action Models
🛡️
AI Safety
arxiv.org
·
12h
LKV
: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache
Eviction
💨
Cache-Friendly Algorithms
arxiv.org
·
1d
RDKV
: Rate-Distortion Bit Allocation for Joint
Eviction
and Quantization of the KV Cache
🔬
RaBitQ
arxiv.org
·
12h
CuBridge
: An LLM-Based Framework for Understanding and
Reconstructing
High-Performance Attention Kernels
🏗️
LLM Infrastructure
arxiv.org
·
5d
Sparse
Attention as a Range Searching Problem: Towards an Inference-Efficient Index for
KV
Cache
🗂️
Vector Indexes
arxiv.org
·
1d
Dooly
: Configuration-Agnostic,
Redundancy-Aware
Profiling for LLM Inference Simulation
🏗️
LLM Infrastructure
arxiv.org
·
1d
Predict-then-Diffuse
: Adaptive Response Length for
Compute-Budgeted
Inference in Diffusion LLMs
🏗️
LLM Infrastructure
arxiv.org
·
5d
An Efficient Hybrid
Sparse
Attention with CPU-GPU
Parallelism
for Long-Context Inference
📦
Batch Embeddings
arxiv.org
·
1d
LAWS: Learning from Actual Workloads
Symbolically
-- A Self-Certifying
Parametrized
Cache Architecture for Neural Inference, Robotics, and Edge Deployment
📦
Batch Embeddings
arxiv.org
·
5d
·
Hacker News
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help